vitspace.netlify.co
m
For more Materials
Click the link given below
Scroll to read your material
PARVATHA REDDY BABUL REDDY
VISVODAYA INSTITUTE OF TECHNOLOGY AND SCIENCEs
VITSPACE
Chapter 4
Feedforward Neural Networks
4.1 Introduction
This chapter presents a detailed analysis of the pattern recognition
tasks that can be performed by a artificial neural net-
work. As mentioned earlier, a feedforward artificial neural network
consists of layers of processing units, each layer feeding input to the
next layer in a feedforward manner through a set of connection
strengths or weights. The simplest such network is a two layer network.
By a suitable choice of architecture for a feedforward network, it
is possible to perform several pattern recognition tasks. The simplest
task is a pattern association task, which can be realized by a two
layer feedforward network with linear processing units. A detailed
analysis of the linear association network shows that the network is
limited in its capabilities. In particular, the number of input-output
pattern pairs to be associated are limited to the dimensionality of the
input pattern, and also the set of input patterns must be linearly
independent. The constraint on the number of input patterns is
overcome by using a two layer feedforward network with nonlinear
processing units in the output layer. This modification automatically
leads to the consideration of pattern classification problems. While
this modification overcomes the constraints of number and linear
independence on the input patterns, it introduces the limitations of
linear separability of the functional relation between input and output
patterns. Classification problems which are not linearly separable are
called hard problems. In order to overcome the constraint of linear
separability for pattern classification problems, a multilayer
feedforward network with nonlinear processing units in all the
intermediate hidden layers and in the output layer is proposed. While
a multilayer feedforward architecture could solve representation of
the hard problems in a network, it introduces the problem of hard
learning, the difficulty in adjusting the weights of the network
to capture the implied functional relationship between the given
input-output pattern pairs. The hard learning problem is solved by
using backpropagation learning algorithm. The resulting network
provides a solution to the pattern mapping problems. The generaliza-
Introduction 89
(ability to learn a mapping function) feature of a multilayer
feedforward network with the backpropagation learning law depends
on several factors such as the architectural details of the network,
the learning rate parameter of the training process and the training
samples themselves.
Table 4.1 shows the summary of the topics to be discussed in this
chapter. The pattern association problem is discussed in Section 4.2.
Table 4.1 Pattern Recognition Tasks by Feedfoward Neural Networks
Pattern association
Architecture:
Learning:
Two layers, linear processing units, single set of weights
Hebb's (orthogonal) rule, Delta (linearly independent)
rule
Direct
Linear independence, number of patterns restricted to
input dimensionality
Nonlinear processing units, leads to a pattern
classification problem
Recall:
Limitation:
To overcome:
Pattern classification
Architecture:
Two layers, nonlinear processing units,
interpretation
Perceptron learning
Direct
Linearly separable functions, cannot handle hard
problems
More layers, leads to a hard learning problem
Learning:
Recall:
Limitation:
To overcome:
Pattern mapping or classification
Archztecture:
Learning:
Recall:
Limitation:
(hidden), nonlinear processing units,geometri-
cal interpretation
Generalized delta rule (backpropagation)
Direct
Slow learning, does not guarantee convergence
To overcome: More complex architecture
This section gives a detailed analysis of a linear associative network,
and shows the limitations of the network for pattern association
problems. Section 4.3 describes the pattern classification problem. An
analysis of a two layer feedforward network with nonlinear processing
units in the output layer brings out the limitations of the network
for pattern classification task. The section also discusses the problems
of classification, representation, learning and convergence in the
context of perceptron networks. In Section 4.4 the problem of pattern
mapping by a multilayer neural network is discussed. The chapter
concludes with a discussion on the backpropagation learning law and
its implications for generalization in a pattern mapping problem.
90 Neural Networks
2. Analysis of Pattern Association Networks
1. Linear Associative Network
The objective in pattern association is to design a network that can
represent the association in the pairs of vectors 1 1, 2,
L, through a set of weights to be determined by a learning law. The
given set of input-output pattern pairs is called training data. The
input are typically generated synthetically, like machine
printed characters. The input patterns used for recall may be
corrupted by external noise.
The following vector and matrix notations are used for the
analysis of a linear associative network:
T
=
x =
T
T
... is an
... is an
matrix
matrix
A =
B =
=
Input vector
Activation vector of input
layer
Activation vector of
output layer
Output
Input matrix
Output matrix
Weight matrix
Weight vector for
unit of output layer
T
I
=
The network consists of a set of weights connecting two layers of
processing units as shown in Figure 4.1. The output function of each
Output vector
Activation vector
of output layer
Activation vector
of input layer
Input vector
4.1 Linear associative network.
Analysis of Pattern Association Networks
91
unit in these layers is linear. Each output unit from
the M input units corresponding to the M-dimensional input vectors.
The number of output units corresponds to the dimensionality of
the output vectors. Due to linearity of the output function, the
activation values and the signal values of the units in the input
layer are the same as the input data values The activation value
of the unit in the output layer is given by
T
=
=
j =
N.
i = l
The output of the unit is the same as its activation value
since the output function of the unit is linear, = The
network is called linear since the output of the network is simply a
linear weighted sum of the component values of the input pattern.
The objective is to determine a set of weights in such a way
that the actual output is equal to the desired output for all the
given L pattern pairs. The weights are determined by using the
criterion that the total mean squared error between the desired
output and the actual output is to be minimized. The weights can be
determined either by computing them from the training data set or
by learning. Computation of weights makes use of all the training
set data together. On the other hand, in learning, the weights are
updated after presentation of each of the input-output pattern pairs
in the training set.
4.2.2 Determination of Weights by Computation
For a linear associative network Mecht-Nielsen, 19901,
=
i = M
(4.2)
Actual output vector
= y = Wx =
Error in the output is given by the distance between the desired
output vector and the actual output vector. The total error over
all the L input-output pattern pairs is given by
92
Feedforward Neural Networks
We can write
where the square norm
Using the definition that the trace of a square matrix S is the
sum of the main diagonal entries of S, it is easy to see that
where the matrix S is given by
and
is the trace of the
Using the definition for pseudoinverse of a matrix
=
19551,
= and
we get the matrix identities
= A. Using these matrix identities we get
It can be seen that the trace of the first term in Eq. (4.11) is always
nonnegative, as it is in a quadratic form of the real symmetric matrix
AA
T
. It becomes zero for W = The trace of the second term is a
constant, independent of W. Since the trace of sum of matrices is the
sum of traces of the individual matrices, the error is minimum
when W=
BA+ in Eq.
The minimum error is obtained by substituting W=
and is given by
where is an L x L identity matrix. The above simplification is
obtained by using the following matrix identities:
=
and = A.
The following singular value decomposition of an M L
matrix A is used to compute the pseudoinverse and to evaluate the
minimum error. Assuming L M, the of a matrix A is given by
19801
Analysis of Pattern Association Networks
where = = and the sets
are each orthogonal. The eigenvalues
93
and
of the matrices
and will be real and nonnegative, since the matrices are
symmetric. The eigenvalues are ordered, I,
Note that the
are vectors and are L-dimensional vectors.
The pseudoinverse of A is given by
where is the rank (maximum number of linearly independent
columns) of A. Also turns out to be the number of nonzero
eigenvalues I,. Note that if = L, then all the L column vectors are
linearly independent.
Using the SVD, it can be shown that = if L is the
number of linearly independent columns of A. In such a case
= (null and hence
= (See Eq. (4.12)).
Therefore, for a linearly independent set of input pattern vectors, the
error in the recall of the associated output pattern vector is zero, if
the optimum choice of W= is used.
If the rank of the matrix A is less than L, then the input vectors
are linearly dependent. In this case also the choice of the weight
matrix as W = BA+still results in the least But this least
error is not zero in this case. The value of the error depends on the
rank of the matrix. The matrix will have a sub-matrix
and all the other four sub-matrices will be zero. That is
The expression for minimum error is given from Eq.(4.12) as
The issue is how to achieve the minimum error retrieval
from the linear associative network, when there is noise added to
the input vectors. The noisy input vectors are
It is assumed that each component of the noise vector is
uncorrelated with the other components and also with the components
of the pattern vectors, and has the same standard deviation Let
C be an M x L matrix of the noisy input vectors. The objective is to
find a W that minimizes
Feedforward Neural Networks
Murakami has shown that, if W = then the expression for
error is given by [Murakami and Aibara, 19871
The first term in the square brackets can be attributed to the linear
dependency part of the column vectors of A. If the rank of the matrix
A is L, then = L, and the first term will be zero. Therefore, the
error is determined by the noise power If in addition, there is no
noise, = then the error = In that case error-free
recall is possible.
To minimize
vectors, choose W= B
the presence of noise in the input pattern
where
The value of s is determined in such a way that
That is, the noise power
considered in the
will decide how many terms should be
expression for the pseudoinverse. If the
eigenvalue is so small that the noise power dominates, then that
eigenvector could as well be included in the first term of the
expression in Eq. (4.19) for the error corresponding to the linear
dependence. This will reduce the error.
Note that this analysis is valid only if the legal inputs are
corrupted by noise. It is not valid if the input consists of only noise.
The expression for the error
the column vectors in A
is applicable for the closed set of
and Aibara, 19871.
4.23 Determination of Weights by Learning
It is desirable to determine the weights of a network in an
incremental manner, as and when a new training input-output
pattern pair is available. This is called learning. Each update of the
weights with a new input data can be interpreted as network
learning. Computationally also learning is preferable because it does
not require information of all the training set data at the same time.
As will be seen later in this section, it is also preferable to have
learning confined to a local operation. That is, the update of a weight
connecting two processing units depends only on the connection
Analysis of Association Networks
95
weight and the activations of the units on either side of the
connection. Two learning laws and their variations, as applicable to
a linear associative network, are discussed in this section.
Hebb's law: Let the input pattern vector and the corresponding
desired output pattern vector be applied to the linear associative
network. According to the Hebb's law, the updated weight value of a
connection depends only on the activations of the processing units on
either side of the connecting link. That is
Note that the computation of the increment = is purely
local for the processor unit and the input-output pattern pair. The
updated weight matrix for the application of the pair is
given by
= 1)+
(4.23)
where 1 ) refers to the weight matrix after presentation of the
first - 1 ) pattern pairs, and refers to the weight matrix after
presentation of the first pattern pairs. Note that is the outer
product of the two vectors, which results in an N M matrix. Each
element of this matrix is an increment of the corresponding element
in the weight matrix.
If the initial values of the elements of the weight matrix are
assumed to be zero, then the weight matrix resulting after application
= 1, 2, L, is
of the L input-output pattern vector pairs
given by
L
W = = BAT,
where the element of W is given by
To verify whether the network has learnt the association of the
given set of input-output pattern vector pairs, apply the input pattern
and determine the actual output vector .
96 Feedforward Neural Networks
It is obvious from the above equation that the actual output
is not the same as the desired output Only if some restrictions
are imposed on the set of input pattern vectors we
can get the recall of the correct output pattern for the input pattern
The restriction is that the set of input vectors must be
normal. That is
In such a case the first term in the expression for in Eq. (4.26)
becomes and the second becomes zero, as each of the
products
is zero for k. The restriction of orthogonality limits
the total number of the input patterns in the set A to M, the
dimensionality of the input vectors, as there can be only M or less
than M mutually orthogonal vectors in an M-dimensional space.
If the restriction of orthogonality on the set of input vectors is
relaxed to mere linear independence, then the expression in Eq. (4.26)
for recall reduces to
where it is assumed that the vectors are of unit magnitude, so that
= 1. This leaves an error. term e indicating that the recall is
not perfect, if the weight matrix is derived using the Hebb's law.
However, it was shown in the previous Section 4.2.2 that, for
linearly independent set of input vectors, exact recall can be achieved
if the weight matrix W is chosen as W = where is the
pseudoinverse of the matrix A. If the set of input vectors are not
linearly independent, then still the best choice of W is as this
yields, on the average, the least squared error in the recall of the
associated pattern. The error is defined as the difference between the
desired and the actual output patterns from the associative network.
If is
is
the input, the best choice of the weight matrix W
includes fewer terms in the singular value
decomposition expansion of A than the rank of the matrix, the
choice of s being dictated by the level of the noise (See Eq. (4.21)).
For all these best choices of W, the weight values have to be
computed from the knowledge of the complete input pattern matrix
A, since all of them need the of A to compute the pseudoinverse
However, it is possible, at least in some cases, to develop learning
algorithms which can approach the best choices for the weight
matrices. The purpose of these learning algorithms is to provide a
procedure for incremental update of the weight matrix when an
input-output pattern pair is presented to the network. Most of these
learning algorithms are based on gradient descent along an error
surface (See Appendix The most basic among them is and
Analysis of Pattern Association Networks
97
least mean square algorithm and Hoff, 19601.
The gradient descent algorithms are discussed in detail later in the
section on pattern mapping tasks.
law: A form of learning can be used to obtain W=
BA+recursively. Let W(l - 1 ) be the weight matrix after presentation
of (1- 1)samples. Then
1)= - where the matrices
- 1 ) and - 1 ) are composed of the first - vectors of
and the first (1- 1 ) vectors of respectively. When the pair
is given to the network, then the updated matrix is given by (See
[Hecht-Nielsen, 19901)
=
+ - (4.29)
where
[I- -
if the denominator is
[I- -
otherwise
-
+
By starting with zero initial
values for all the weights, and
successively adding the pairs
obtain the final pseudoinverse-based weight matrix W =
problem with this approach is that the recursive procedure
we can
The
be
implemented locally because of the need to calculate in Eq. (4.29).
The same eventual effect can be approximately realized using the
following variation of the above learning law,
where is a small positive constant called the learning rate
parameter. This learning law can be implemented locally
by means of the following equation,
where - 1)is the weight vector associated with the jth processing
unit in the output layer of the linear associative network at
the (1- iteration. With this scheme, it is necessary to apply
the pairs of the training set data several times, with each
pair chosen at random.
The convergence of the learning law in Eq. (4.32)
depends on the choice of the learning rate parameter For
sufficiently low values of a linear associative network can
adaptively form only an approximation to the desired weight matrix
W = There is no known method for adaptively learning the best
98 Feedforward Neural Networks
choice of the weight matrix W = Note also that no method is
known to adaptively learn even an approximation to the best choice
of the weight matrix W = in the case of additive noise in the
input pattern vectors. Therefore, in the case of noisy patterns,
best weight matrix has to be computed using the expressions for A+
in terms of the components of the singular value decomposition of A,
depending on the estimated noise level in the input patterns. This is
obvious from the fact that noise effects can be reduced only when its
statistics are observed over several patterns.
4.2.4 Discussion on Pattern Association Problem
Table 4.2 gives a summary of the results of linear associative networks.
Table 4.2 Summary of Results of Linear Associative Networks
Pattern association
Given a set b,)) of L pattern pairs, the objective is to determine the
weights of a linear associative network so as to minimize the error between
the desired and actual outputs. If A
... B= ... and Ware
the input, output and weight matrices, respectively, then the optimum weights
are given by
(a) W =
(b) W =
(c) W =
for orthogonal set of input vectors
for linearly independent set of input vectors (full rank
square matrix: r = L =
for linearly independent set of input vectors (full rank
matrix: r = L
for linearly dependent set of input vectors (reduced rank:
W =
W =
for noisy input vectors
For the cases (a), and the minimum error is zero. For the case
the minimum error is determined by the rank of the input matrix. For the
case (e) the minimum error is determined by both the rank of the input
matrix and the noise power.
Determination of weights by learning
(a) For orthogonal input vectors the optimum weights W = can be
obtained using learning law.
(b) For linearly independent or dependent input vectors an approximation
to the optimum weights W = can be learnt using a form of
learning law.
For noisy input vectors there is no known learning law that can provide
even an approximation to the optimum weights W=
Analysis of Pattern Classification Networks 99
It is useful to allow the processing units in the output layer of
the network to have a bias input. In such a case the input matrix A
to this layer is augmented with an additional column vector, whose
values are always -1. Addition of this bias term results in a weight
matrix W that performs an transformation. With the affine
transformation, any arbitrary rotation, scaling and translation
operation on patterns can be handled, whereas linear transformations
of the previous associative network can carry out only arbitrary rotation
and scaling operations on the input patterns 19901.
In many applications the linkage between the dimensionality
of the input data and the number of data items that can be
associated and recalled is an unacceptable restriction. By means of
coding schemes, the dimeriionality of the input data can sometimes
be increased artificially, thus allowing more pairs of items
to be associated 19891.
But, as we will see in the next section, the dependence of the
number of input patterns on the dimensionality of the pattern vector
can be removed completely by using nonlinear processing units in the
output layer. Thus the artificial neural networks can capture the
association among the pairs 1 = 1, 2, L, even when the
number of input patterns is greater than the dimensionality of the
input vectors, L M. While the constraint of dimensionality on
the number of input patterns is removed in the artificial neural
networks, some other restrictions will be placed which involve the
functional relation between an input and the corresponding output.
In particular, the implied mapping between the input and output
pattern pairs can be captured by a two layer artificial neural network,
provided the mapping function belongs to a linearly separable class.
But the number of linearly separable functions decrease rapidly as
the dimensionality of the input and output pattern vectors increases.
These issues will be discussed in the following section.
4.3 Analysis of Pattern Classification Networks
In an space if a set of points could be considered as
input patterns, and if an output pattern, not necessarily distinct from
one another, is assigned to each of the input patterns, then the
number of distinct output patterns can be viewed as distinct classes
or class labels for the input patterns. There is no restriction on the
number of input patterns. The input-output pattern vector pairs
1 = 1, 2, L, in this case can be considered as a training
set for a pattern classification problem. Typically, for pattern
classification problems, the output patterns are points in a discrete
(normally binary) N-dimensional space. The input patterns are
usually from natural sources like speech and hand-printed characters.
The input patterns may be corrupted by external noise. Even a noisy
100 Feedforward Neural Networks
input will be mapped onto one of the distinct pattern classes, and
hence the recall displays an accretive behaviour.
4.3.1 Pattern Classification Perceptron
A two layer network with nonlinear (hard-limiting)output
functions for the units in the output layer can be used to perform the
task of pattern classification. The number of units in the input layer
corresponds to the dimensionality of the input pattern vectors. The units
in the input layer are all linear, as the input layer merely contributes
to fan out the input to each of the output units. The number of output
units depends on the number of distinct classes in the pattern
classification task. We assume for this discussion that the output units
are binary. Each output unit is connected to all the input units, and a
weight is associated with each connection. Since the output function of
a unit is a hard-limiting threshold function, for a given set of input-
output patterns, the weighted sum of the input values is compared with
the threshold for the unit to determine whether the sum is greater or
less than the threshold. Thus in this case a set of inequalities are
generated with the given data. Thus there is no unique solution for the
weights in this case, as in the case of linear associative network. It is
necessary to determine a set of weights to satisfy all the inequalities.
Determination of such weights is usually. accompanied by means of
incremental adjustment of the weights using a learning law.
A detailed analysis of pattern classification networks is presented
here assuming M input units and a single binary output unit. The
output unit uses a hard-limiting threshold function to decide whether
the output signal should be 1 or if the weighted sum of
the input values to the output unit exceeds the threshold, the output
signal is labelled as 1, otherwise as Extension of the analysis for
a network consisting of multiple binary units in the output layer is
trivial 19921. Multiple binary output units are needed if the
number of pattern classes exceeds 2.
Pattern classification problem: If a subset of the input patterns
belong to one class (say class A,) and the remaining subset of the
input patterns to another class (say class A,), then the objective in
a pattern classification problem is to determine a set of weights
that if the weighted sum
M
8, then a = a,, belongs to class A, (4.33)
i = l
and if
M
8, then a =
a,, belongs to class A, (4.34)
Analysis of Pattern Classification Networks
101
Note that the dividing surface between the two classes is given by
This equation represents a linear hyperplane in the
space. The hyperplane becomes a point if M =1, a straight line if M = 2,
and a plane if M = 3.
Since the solution of the classification problem involves
determining the weights and the threshold value, the classification
network can be depicted as shown in Figure 4.2, where the input
Figure 4.2 A single unit pattern classification network (perceptron).
to the connection involving the threshold value = is always
- 1. Defining the augmented input and weight as
a = (-1,a,, and w = respectively, the
ceptron classification problem can be stated as follows:
If w
T
a
if w
T
a
then a belongs to class A,, and
then a belongs to class A,.
The equation for the dividing linear hyperplane is w
T
a = 0.
Perceptron learning law: In the above perceptron classification
problem, the input space is an space and the number
of output patterns are two, corresponding to the two classes. Note
that we use the +1)-dimensional
to denote a point in the
M-dimensional space, as the component of the vector is always -1.
Suppose the subsets A, and of points in the M-dimensional space
contain the sample patterns belonging to the classes A, and A,,
respectively. The objective in the perceptron learning is to
systematically adjust the weights for each presentation of an input
vector belonging to A, or A, along with its class identification. The
perceptron learning law for the two-class problem may be stated as
follows:
and w
T
(m)a
and w
T
(m)a
+1) = + a, if a
= - a, if a
(4.36)
102 Feedforward Neural Networks
where the index m is used to denote the learning process at the mth
step. The vectors a and are the input and weight vectors,
respectively, at the mth step, and is a positive learning rate
parameter. can be varying at each learning step, although it is
assumed as constant in the perceptron learning. Note that no
adjustment of weights is made when the input vector is correctly
classified. That is,
+1) =
=
if a
if a
and w
T
(m)a
and w
T
(m)a
(4.37)
The initial value of the weight vector could be random.
Figure 4.3 shows an example of the decision boundaries at different
Figure 4.3 Illustration of decision boundaries formed during implementation
of perceptronlearning for linearly separable classes.
times for a 2-dimensional input vector. The equation of the straight
line is given by
+
= (4.38)
For different values of the weights during learning, the position
of the line changes. Note that in this example the two classes can be
separated by a straight line, and hence they are called linearly
separable classes. On the other hand consider the example of the
pattern classification problem in Figure 4.4. In this case the straight
line wanders in the plane during learning, and the weights do not
converge to a final stable value, as the two classes cannot be
separated by a single straight line.
Perceptron convergence theorem: This theorem states that the
perceptron learning law converges to a final set of weight values in
a finite number of steps, if the classes are linearly separable. The
proof of this theorem is as follows:
Analysis of Pattern Networks
Figure4.4
Illustration of decision boundaries formed during implementation
of perceptronlearning for linearly inseparable classes.
Let a and w be the augmented input and weight
respectively. Assuming that there exists a solution w* for the
classification problem, we have to show that w*can be approached
in a finite number of steps, starting from some initial random weight
values. We know that the solution w* satisfies the following
inequality as per the Eq. (4.37):
where
The weight
a 0, for each a (4.39)
a =
is updated if w
T
(m)a 0, for a That is,
+1) =
for = a
is used to denote the input vector at step m. If we start
where
with
= where is an all zero column vector, then
=
(4.41)
Multiplying both sides of Eq. (4.41)by
we get
qma (4.42)
=
since
inequality
a according to Eq. (4.39).Using the Cauchy-Schwartz
w
2
.
we get from Eq. (4.42)
104 Feedforward Neural Networks
We also have from Eq. (4.40)
Therefore, starting
since for learning w
T
(m)a(m) when
from we get from Eq. (4.45)
where = max Combining Eqs. (4.44)and
the optimum value of m by solving
we obtain
Since is positive, Eq. (4.48) shows that the optimum weight value
can be approached in a finite number of steps using the perceptron
learning law.
Alternate proof of the convergence theorem: Assume that a
solution vector w* exists. Then using the following perceptron behaviour
for a
and
-
for a
between
we can show that the magnitude of the cosine of the angle
the weight vectors w* and is given by
where
and
= min
a
and Eqs. (4.49)
and
= max a
a
Using the perceptron learning law in Eq.
we get the following:
Analysis of Pattern Classification Networks
105
+1) =
)
(4.54)
Starting
)
+
+qa, for w
T
(m)a(m
= we get
for w
T
(m)a(m and
(4.55)
and Eqs.
Likewise, using the perceptron learning law in Eq.
(4.50) and we get
- for
Starting from
= we get
- mqa, for
0, and
(4.56)
That is
m)
mqa, for w
T
(m)a(
Therefore from Eqs. (4.55) and
0, and
we get
(4.58)
Similarly, using Eq.
mqa, for all a
we get
+ +
for
(4.59)
and
+
for
(4.60)
Starting from
= we get for both (4.59) and (4.60)
(4.61)
Therefore, from Eqs.
(4.58) and
we get
106 Feedforward Neural Networks
Discussion on the convergencetheorem: The number of iterations
on the relation between and w*.Normally
for convergence
the initial value
is set to The initial setting of the weight
values does not affect the proof of the perceptron convergence theorem.
The working of the perceptron learning can be viewed as follows:
At the (m +
iteration we have
+1) =
and
for w
T
(m)a(m
)
From this we get
)
Notice that if w
T
(m)a(m then w
T
(m+
chosen as the smallest positive real number
provided is
1) such that
Thus the given pattern is classified correctly if it is presented
to the perceptron with the new weight vector
+ 1).The weight
vector is adjusted to enable the pattern to be classified correctly.
The perceptron convergence theorem for the two class problem is
applicable for both binary 1) and bipolar (- 1, input and output
data. By considering a two-class problem each time, the perceptron
convergence theorem can be proved for a multiclass problem as well.
The perceptron learning law and its proof of convergence are
applicable for a single layer of nonlinear processing units, also called
a single layer perceptron. Note that convergence takes place provided
an optimal solution w*exists. Such a solution exists for a single layer
perceptron, only if the given classification problem is linearly
separable. In other words, the perceptron learning law converges to
a solution only if the class boundaries are separable by linear
hyperplanes in the M-dimensional input pattern space.
Perceptron learning as gradient descent: The perceptron learning
law in Eq. (4.36)can also be written as
where is the desired output, which for the binary case is given by
(4.69)
and
= 1, for
= 0, for
is the actual output for the input vector
to the
)
(4.71)
(4.72)
perceptron. The actual output is given by
= 1, if w
T
(m)a(m
= if w
T
(m)a(m
)
Analysis of Pattern Networks 107
From Eq. (4.68) we note that if = then =
no correction takes place. On the other hand, if there is an error,
then the update rule given by (4.68) is same as the
update rule given in Eq. (4.36).
Note that Eq. (4.68) is also valid for a bipolar output function,
when = = 1. Therefore Eq. (4.68) can be
written as
where =
- is the error signal. If we use the
instantaneous correlation(product) between the output error and
= w
T
(m)a(m)as a measure of performance
the activation value
then
The negative derivative of with respect to the weight vector
can be defined as the negative gradient of and is given by
Thus the weight update in the perceptron learning
Eq. (4.73) is proportional to the negative gradient of the performance
measure
Perceptron representatlon problem: Convergence in the perceptron
learning takes place only if the pattern classes are linearly separable
in the pattern space. Linear separability requires that the convex
hulls of the pattern sets of the classes are disjoint. A convex hull of
a pattern set is the smallest convex set in that contains A
convex set is a set of points in the M-dimensional space such that a
line joining any two points in the set lies entirely in the region
enclosed by the set. For linearly separable classes, the perceptron
convergence theorem ensures that the final set of weights will be
reached in a finite number of steps. These weights define a linear
hyperplane separating the two classes. But in practice the number
of linearly separable functions will decrease rapidly as the dimension
of the input pattern space is increased [Cameron, 1960; Muroga,
19711. Table 4.3 shows the number of linearly separable functions for
a two-class problem with binary input patterns for different
dimensions of the input pattern space. For binary pattern classifica-
tion problems = 2), there are 14 functions which are linearly
separable. The problem in Figure is one of the linearly separable
functions. There are two functions which are linearly inseparable, one
of which is shown in Figure These linearly inseparable problems
do not lead to convergence of weights through the perceptron learning
108 Feedforward Neural Networks
law, indicating that these problems are not representable by a single
layer of perceptron discussed so far.
Table 4.3 . Number of Linearly Separable Functions for a Two-class Problem
Dimension of input
data M
Number of possible
functions
Number of linearly
separable functions
4
16
256
65536
- 4.3
1
2
3
4
5
6
- 1.8 x
4
14
104
1882
94572
15028134
Figure 4.5 Examples of (a) linearly separable and linearly inseparable
classification problems. The classes are indicated by 'x' and
4.3.2 Linear Inseparability: Hard Problems
A two-layer feedforward network with hard-limiting threshold units
in the output layer can solve linearly separable pattern classification
problems. This is also called a single layer perceptron, as there is
only one layer of nonlinear units. There are many problems which
are not linearly separable, and hence are not representable by a single
layer perceptron. These unrepresentable problems are called hard
problems. Some of these problems are illustrated using the perceptron
model consisting of sensory units, association units and the output
layer as shown in Figure 4.6. The output unit of the perceptron
computes a logical predicate, based on the information fed to it by
the association units connected to it. The association units form a
family of local predicates, computing a set of local properties or
features. The family of local predicates are looking at a 'retina', which
consists of points on a plane, in the figure corresponds to the
sensory input. In the simple 'linear' perceptron the output unit looks
at the local predicates from the association units, takes their weighted
sum, compares with a threshold, and then responds with a value for
the overall logical predicate, which is either true or false. If it were
Analysis of Pattern Classification Networks
Weights
Input (adjustable)
Sensory
units
Association
units
Summing
unit unit
Figure 4.6 Rosenblatt's perceptron model of a neuron.
possible for the local predicates to look at every point in the entire
retina, any possible logical predicate could be computed. This would
be impractical, since the number of possible patterns on the retina
grows exponentially with the size of the retina and
19901. Two important limitations on the local predicates are:
order-limited, where only a certain maximum order of retinal points
could be connected to the local decision unit computing the local
predicate, and diameter-limited, where only a geometrically restricted
region of retina could be connected to the local predicate. The order-
limited perceptron cannot compute the parity problem examples
shown in Figure 4.7, where images with an even number of distinct
Figure 4.7 A parity problem illustrating the order-limited perceptron: (a)
Even parity and Odd parity.
110 Feedforward Neural Networks
unconnected patterns (Figure should produce an output 1,
otherwise the output for the odd number of patterns in Figure
should be Likewise the diameter-limited perceptron cannot handle
the connectedness problem examples shown in Figure 4.8, where to
detect connectedness the output of the perceptron should be 1 if there
is only one pattern as in Figure
for patterns shown in
otherwise it should be
and Morton, 19901.
A connectedness problem illustrating the diameter-limited
perceptron: Connected class and Disconnected class.
4.3.3 Geometrical of Hard Problems:
Perceptron
In this section the problem of pattern classification and the
performance of feedforward neural networks are discussed in
geometric terms. A pattern classification problem can be viewed as
determining the separating the multidimensional
patterns belonging to different classes. For convenience throughout
this section we consider a pattern space. If the pattern
classes are linearly separable then the hypersurfaces reduce to
straight lines as shown in Figure 4.9. A two-layer network consisting
of two input units and N output units can produce N distinct lines
in the pattern space. These lines can be used to separate different
classes, provided the regions formed by the pattern classification
problem are linearly separable. As mentioned earlier (See Table
linearly separable problems are in general far fewer among all
possible problems, especially as the dimensionality of the input space
increases. If the outputs of the second layer are combined by a set
of units forming another layer, then it can be shown that any convex
region can be formed by the separating surfaces 1987;
A convex region is one in which a line joining any
Analysis of Pattern Classification Networks
Output layer
(hard-limiting
Input layer
units)
Figure 4.9 An example of linearly separable classes: Network and
Linearly separableclasses.
two points is entirely confined to the region itself. Figure
illustrates the regions that can be created by a three layer network.
In this case the number of units in the second layer determines the
shape of the dividing surface. The number of units in the third layer
decides the number of classes. It can be seen that the three-layer
network (Fig. is not general enough, as it is not guaranteed
that the class regions in the pattern space form convex regions in all
cases. In fad one could have a situation as shown for the classes with
meshed regions, where the desired classes are enclosed by complicated
nonconvex regions. Note that intersection of linear hyperplanes in the
three layer network can only produce convex surfaces.
However, intersection of the convex regions may produce any
nonconvex region also. Thus adding one more layer of units to
combine the outputs of the third layer can yield surfaces which can
separate even the nonconvex regions of the type shown in Figure
In fad it can be shown that a four-layer network with the
input layer consisting of linear units, and the other three layers
consisting of hard-limiting nonlinear units, can perform any complex
pattern classification tasks. Thus all the hard problems mentioned
earlier can be handled by a multilayer feedforward neural network,
with nonlinear units. Such a network is also called a multilayer
perceptron. Note that the two-layer network in Figure is a
112 Feedforward Neural Networks
single-layer perceptron, and the three-layer and four-layer networks
in the Figures and are two-layer and three-layer
perceptrons, respectively.
Figure 4.10 Geometrical interpretation of pattern classification. The figure
shows decision regions for different layers of perceptron
networks. [Adaptedfrom
The above discussion is focussed primarily on a multilayer
perceptron network with units having hard-limiting nonlinear output
functions. Similar behaviour is expected from a multilayer
forward neural network when the output functions of the units are
continuous nonlinear functions, such as functions. In these
cases the decision regions are typically bounded by smooth surfaces
instead of linear hyperplanes, and hence geometrical visualization
and interpretation is difficult.
Table 4.4 gives a summary of the discussion on the perceptron
network. The main difficulty with a multilayer perceptron network
is that it is not straightforward to adjust the weights leading to the
units in the intermediate layers, since the desired output values of
the units in these layers are not known. The perceptron learning uses
the knowledge of the error between the desired output and the actual
output to adjust the weights. From the given data only the desired
Analysis of Pattern Mapping Networks
Table 4.4 Summary of Issues in Perceptron Learning
1.
2.
3.
network
Weighted sum of the input to a unit with hard-limiting output function
classification problem
For a two class (A,and A,) problem, determine the weights (w) and
threshold (8) such that
for and w
T
a- 850, for
learning law
The weights are determined in an iterative manner using the following
learning law at the + iteration:
+ = +
=
8 and
8 and
for
for
where
4.
is a (positive) learning rate parameter.
learning as gradient descent
The perceptron learning law can be rewritten as a single equation:
+ 1) +
where
-
Denoting
we have
-
where
-
5. convergence theorem
The perceptron learning law converges in a finite number of steps,
provided that the given classification problem is representable.
6. Perceptron representation problem
A classification problem is representable by a single layer perceptron if
the classes are linearly separable, separable by linear hyperplanes in
the input feature space. Classification problems that are not linearly
separable are called hard problems.
7. perceptron
Any pattern classification prohlem, including the hard problems, can be
represented by a multilayer perceptron network.
output values of the units in the final output layer are known. Thus,
although a multilayer perceptron network can handle hard problems,
the problem of learning or training such a network, called hard
learning problem, remains. This problem is discussed in detail in the
following section.
4. Analysis of Pattern Mapping Networks
1. Pattern Mapping Problem
If a set of input-output pattern pairs is given corresponding to an
114 Feedforward Neural Networks
arbitrary function transforming a point in the M-dimensional input
pattern space to a point in the N-dimensional output pattern space,
then the problem of capturing the implied functional relationship is
called a mapping problem. The network that accomplishes this task
is a mapping network. Since no restriction such as linear
separability is placed on the set of input-output pattern pairs, the
pattern mapping problem is a more general case of pattern
classification problem.
Note that the objective in the mapping problem is to
capture the implied function, the generalization implied in the
given input-output pattern pairs. This can also be viewed as an
approximation of the function from a given data. For a complex
system with multiple inputs and multiple outputs,'if several input-
output pattern pairs are collected during experimentation, then the
objective in the mapping problem is to capture the system
characteristics from the observed data. For a new input, the captured
system is expected to produce an output close to the one that would
have been obtained by the real system. In terms of function
approximation, the approximate mapping system should give an
output which is close to the values of the real function for inputs
close to the current input used during learning. Note that the
approximate system does not produce strictly an interpolated output,
as the function finally captured during learning may not fit any of
the points given in the training set. This is illustrated in Figure 4.11.
input
4.11 Function approximation in pattern mapping problem.
4.4.2 PatternMapping Network
Earlier we have seen that a multilayer feedforward neural network
with at least two intermediate layers in addition to the input and
output layers can perform any pattern classification task. Such a
network can also perform a pattern mapping task. The additional
layers are called hidden layers, and the number of units in the hidden
layers depends on the nature of the mapping problem. For any
Analysis of Pattern Mapping Networks 115
arbitrary problem, generalization may be difficult. With a sufficiently
large size of the network, it is possible to (virtually) store all the
input-output pattern pairs given in the training set. Then the network
will not be performing the desired mapping, because it will not be
capturing the implied functional relationship between the given
input-output pattern pairs.
Except in the input layer, the units in the other layers must be
nonlinear in order to provide generalization capability for the
network. In fact it can be shown that, if all the units are linear, then
a network can be reduced to an equivalent two-layer
network with a set of N x M weights.
Let and be the weight matrices of appropriate sizes
between the input layer and the first hidden layer, the first hidden
layer and the second hidden layer, and the second hidden layer and
the output layer, respectively. Then if all the units are linear, the
output and input patterns are related by the weight matrix containing
N x M weight elements. That is,
(4.76)
As can be seen easily, such a network reduces to a linear associative
network. But if the units in the output layer are nonlinear, then the
network is limited by the linear separability constraint on the
function relating the input-output pattern pairs. If the units in the
hidden layers and in the output layer are nonlinear, then the number
of unknown weights depend on the number of units in the hidden
layers, besides the number of units in the input and output layers.
The pattern mapping problem involves determining these weights,
given a training set consisting of input-output pattern pairs. We need
a systematic way of updating these weights when each input-output
pattern pair is presented to the network. In order to do this updating
of weights in a supervisory mode, it is necessary to know the desired
output for each unit in the hidden and output layers. Once the desired
output is known, the error, the difference between the desired
and actual outputs from each unit may be used to guide the updating
of the weights leading to the unit from the units in the previous layer.
We know the desired output only for the units in the final output
layer, and not for the units in the hidden layers. Therefore a
straightforward application of a learning rule, that depends on the
difference between the desired and the actual outputs, is not feasible
in this case. The problem of updating the weights in this case is called
a hard learning problem.
The hard learning problem is solved by using a differentiable
nonlinear output function for each unit in the hidden and output
layers. The corresponding learning law is based on propagating the
error from the output layer to the hidden layers for updating the
weights. This is an error correcting learning law, also called the
116 Feedforward Neural Networks
generalized delta rule. It is based on the of gradient descent
along the error surface.
Appendix C gives the background information needed for under-
standing the gradient descent methods. Table 4.5 gives a summary
of the gradient search methods discussed in the Appendix C. In the
following section we derive the generalized delta rule applicable for
a feedforward network with nonlinear units.
Table 45 Summary of Basic Gradient Search Methods
Objective
Determine the optimal set of weights for which the expected error
between the desired and actual outputs is minimum.
For a linear network the error surface is a quadratic function of the
weights
The vector
= w
is given by
where V =
2
and R is the autocorrelation matrix of the input data.
2. Gradient Search Methods
We can write the equation for adjustment of weights as
If R and are known exactly, then the above adjustment in
one step starting from any initial weights
If R and are known only approximately, then the optimum weight
vector can be obtained in an iterative manner by writing
where <
for convergence. This is Newton's method. The error
moves approximately along the path from to Here is a
dimensionless quantity.
If the weights are adjusted in the direction of the negative gradient at
each step, it becomes method of steepest descent.
+ = + -
where for convergence and is the largest
eigenvalue of R. The learning rate parameter has the dimensions of
inverse of signal power. Here convergence is slower than in the
Newton's method.
In general, the gradient cannot be computed, but can only be estimated.
Hence convergence of the gradient descent methods is not guaranteed.
The estimate depends on our knowledge of the error surface.
Analysis of Pattern Mapping Networks
Table
117
3. Nature of error d a c e
Error surface may not be quadratic if R ie to be estimated from a small
eet of samples.
Error surface ie not quadratic for instantaneous measurement of error.
units are nonlinear.
input data, R
Error eurface is also not quadratic if the
Error eurface is not predictable for
will be varying with time.
4. Estimation of gradient
Derivative measurement: Uses general
measurement (linear units): Uses
the error eurface.
of the error surface.
knowledge of
gradient
algorithm. Leads to convergence in the mean
descent).
Instantaneous measurement (nonlinear units):
of the error surface.
specific knowledge
even in the mean in the
Delta rule. No guarantee of
algorithm.
4.4.3 Delta Rule: Backpropagation learning
,
The objective is to develop a learning algorithm for a
feedforward neural network, so that the network can be trained to
capture the mapping implicit in the given set of input-output pattern
pairs. The approach to be followed is basically a gradient descent
along the error surface to arrive at the optimum set of weights. The
error is defined as the squared difference between the desired output
given output pattern) and the actual output obtained at the
output layer of the network due to application of an input pattern
from the given input-output pattern pair. The output is calculated
using the current setting of the weights in all the layers. The optimum
weights may be obtained if the weights are adjusted in such a way
that the gradient descent is made along the total error surface. But
$he desirable characteristic of any learning law is to specify the
incremental update of the weights of the network for each
presentation of an input-output pattern pair. While this may result
in a suboptimal solution, in most cases of practical significance the
result is acceptable.
A learning law, called generalized delta rule or backpropagation
law, is derived in this section 1974; et al,
Let 1 = L be the set of training pattern pairs. It is
not necessary to have all the training data set at one time, nor the
training data set to be a finite set. The objective is to determine the
weight update for each presentation of an input-output pattern pair.
118 Feedforward Neural Networks
Since the given data may be used several times during training, let
us use the index m to indicate the presentation step for the training
pair at step m.
For training a multilayer feedforward neural network, we use the
following estimate of the gradient descent along the error surface to
determine the increment in the weight connecting the units j and i:
where 0 is a learning rate parameter, which may also vary for
each presentation of the training pair. The weight update is given by
The generalized delta rule to be derived below consists of
deriving expressions for for the connections at different layers.
Let us consider the multilayer neural network given in
4.12.The network consist of three layers of units, the first
Figure 4.12 A three layer neural network.
layer has I linear input units indexed by i, the second layer has J
nonlinear units indexed by j, and the third layer has nonlinear
units indexed by k. For simplicity only one layer (the second layer)
of hidden units is considered here. Extension of learning to a network
consisting of several hidden layers is trivial.
Since the input vector is given at the input layer and the
desired output is available only at the output layer, the error
between the desired output vector and the actual output vector
is available only at the output layer. Using this error it is
necessary to adjust the
hidden units, and the weights
from the input units to the
from the hidden units to the
output units.
Let be the current sample of the function mapping
the input space to the output space Let be the actual
output of the network for the input at the step m. The mean
squared error at the mth step is given by
Analysis of Pattern Mapping Networks
119
where
The superscript refers to the output units quantities, the
superscript refers to the hidden units quantities, and , , and
refer to the input, activation and output values for the unit i,
respectively. For the weights leading to the units in the output layer:
where = -
Here the iteration index is omitted in the
functions and variables on the right hand side for convenience.
Therefore
h
= (4.89)
and
=
+
For the weights leading to the units in the hidden layer:
Neural Networks
K
a
s
h
=
k = l
Since = we get
Since =
, we get
i = l
ax;
=
Therefore
- -
K
-
J
where
Hence
= =
Therefore
since = =
K
where = represents the error propagated back to the
output of the hidden units the next layer, hence the name
backpropagation for this learning algorithm. Table 4.6 gives a
summary of the backpropagation learning algorithm.
4.4.4 Discussion on Backpropagation Law
There are several issues which are important for understanding and
implementing the learning in practice
1994; Russo, 1991; 1991;Hush and 1993; Werbos,
19941. A summary of the issues is given in Table 4.7.A few of these
issues will be discussed in this section.
Analysis of Mapping Networks
Table 4.6 Backpropagation Algorithm (Generalized Delta Rule)
121
Given a set of input-output patterns
1, 2, ...
where the lth input vector = and the output vector
=
Assume only one hidden layer and initial setting of weights to be arbitrary.
Assume input layer with only linear units.
Then the output signal is equal to the input activation value for each of these
Let be the learning rate parameter.
Let a = = and b = =
I
=
J
h
Activation of unit i in the input layer, =
Activation of unit j in the hidden layer, =
Output signal from the unit in the
Activation of unit k in the output layer, =
1
Output signal from unit k in the output layer, =
Error term for the kth output unit, 6'; = -
K
Update weights on output layer, +1) =
Error term for the jth hidden unit, =
k = l
+1) =
=
1
Update the weights on the hidden layer,
Calculate the error for the lth pattern,
L
Total error for all patterns, E =
-
k = l
1= 1
Apply the given patterns one by one, may be several times, in some random
order and update the weights until the total error reduces to an acceptable
value.
Table 47 Issues in Backpropagation
Description and features of backpropagation
Significance of error backpropagation
Forward computation (inner product and nonlinear
Backward operation (error calculation and derivative of output function)
'Nature of output function (semilinear)
Stochastic gradient descent
Local computations
Stopping criterion
Performance of backpropagation learning
Initialization of weights
Presentation of training patterns: Pattern and batch
Feedforward Neural Networks
Table 4.7 Issues in Backpropagation Learning
Learning rate
- Range and value of for stability and convergence
- Learning rate adaptation for better convergence
Momentum term for faster convergence
Second order methods for better and faster convergence
Refinement of learning
Stochastic gradient descent, not an optimization
Nonlinear system identification: Extended Kalman-type algorithm
Unconstrained optimization: Conjugate-gradient methods
Asymptotic estimation of a posteriori class probabilities
Fuzzy backpropagation learning
Interpretation of of learning
Ill-posed nature of solution
Approximation of functions
Good estimation of decision surfaces
Nonlinear feature detector followed by linearly separable classification
Estimation of a posteriori class probabilities
Generalization
VC dimension
Cross-validation
Loading problem
Size and efficiency of training set data
Architectures of network
Complexity of problem
networkTasks with
Logic function
Pattern classification
Pattern mapping
Function approximation
Probability estimation
Prediction
Limitations of backpropagation learning
Slow convergence (no proof of convergence)
Local minima problem
Scaling
Need for teacher: Supervised learning
Extensions backpropagation
Learning with critic
Regularization
Radial basis functions
Probabilistic neural networks
Fuzzy neural networks
Analysis of Pattern Mapping Networks
123
Description and features of backpropagation: The training patterns
are applied in some random order one by one, and the weights are
adjusted using the backpropagation learning law. Each application of
the training set patterns is called a cycle. The patterns may have to
be applied for several training cycles to obtain the output error to an
acceptable low value. Once the network is trained, it can be used to
recall the appropriate pattern (in this case some interpolated output
pattern) for a new input pattern. The computation for recall is
straightforward, in the sense that the weights and the output
functions of the units in different layers are used to compute the
activation values and the output signals. The signals from the output
layer correspond to the output.
Backpropagation learning emerged as the most significant result
in the field of artificial neural networks. In fact it is this learning
law that led to the resurgence of interest in neural networks, nearly
15 years period of lull due to exposition of limitations of the
perceptron learning by Minsky and In this section we
will discuss various features including limitations of the
learning. We will also discuss the issues that determine the
performance of the network resulting from the learning law. We will
discuss these issues with reference to specific applications, and also
with reference to some potential applications of the multilayer
feedforward neural networks.
As noted earlier, the backpropagation learning involves
propagation of 'the error backwards from the output layer to the
hidden layers in order to determine the update for the weights leading
to the units in a hidden layer. The error at the output layer itself is
computed using the difference between the desired output and the
actual output at each of the output units. The actual output for a
given input training pattern is determined by computing the outputs
of units for each hidden layer in the forward pass of the input data.
Note that the error in the output is propagated backwards only to
determine the weight updates. There is no feedback of the signal
itself at any stage, as it is a feedforward neural network.
Since the backpropagation learning is exactly the same as the
delta learning (see Section 1.6.3) at the output layer and is similar
to the delta learning with the propagated error at the hidden layers,
it is also called generalized delta rule. The term 'generalized' is used
because the delta learning could be extended to the hidden layer
units. Backpropagation of error is possible only if the output functions
of the nonlinear processing units are differentiable. Note that if these
output functions are linear, then we cannot realize the advantage of
a multilayer network to generate complex decision boundaries for a
nonlinearly separable (hard) classification problems. In fact a multi-
layer feedforward network with linear processing units is equivalent
to a linear associative network, as discussed in Eq. which, in
124 Feedforward Neural Networks
turn, is limited to solving simple pattern association problems. On
the other hand, hard-limiting output function as in a multilayer
perceptron cannot be used for learning the weights. A common
differentiable output function used in the backpropagation learning
is one which possesses a sigmoid nonlinearity. Two examples of
nonlinear function are the logistic function and hyperbolic
tangent function (See Figure 4.13):
(a) Logistic function and
= tanh
derivative
= - f
2
Hyperbolic tangent function and its derivative
Figure 4.13 Logistic and hyperbolic tangent functions and their derivatives
for = 0.5.
Logistic function
,
(4.100)
Hyperbolic function
For the logistic function the limits are
hyperbolic tangent function the limits are - 1 I
I 1, and for the
I 1.
Analysis of Pattern Mapping Networks
Let us consider the derivative of the logistic function
It can be seen from Eq. (4.102) that has the maximum value
of 0.25 when 0.5, and has the minimum value of when fix)
= or 1. Since the amount of change in the weight value leading to
any unit i in the network is proportional to the change is
maximum in the midrange of the activation value. This feature of
the learning law contributes to its stability et al,
Note that the hyperbolic tangent function can be viewed as a
biased and scaled version of the logistic function. That is
The asymmetry of the hyperbolic tangent function seems to make the
learning faster by reducing the number of iterations required for
training 19911.
The backpropagation learning is based on the gradient descent
along the error surface. That is, the weight adjustment is proportional
to the negative gradient of the error with respect to the weight. The
error is the instantaneous error between the desired and the actual
values of the output of the network. This instantaneous error is due
to a given training pattern, which can be assumed to be a sample
function of a random process. Thus the error can be assumed to be
a random variable. Therefore this gradient descent method is a
stochastic gradient learning method. Due to this stochastic nature,
the path to the minimum of the error surface will be zigzag. The
error surface itself will be an approximation to the true error surface
by the entire training set of patterns. Moreover, even the
true error surface is not a smooth quadratic surface as in the case
of the Adaline. In fact the error surface may contain several local
minima besides the global minimum. Hence the stochastic approxima-
tion of the gradient descent used in the backpropagation learning
need not converge. There is no proof of convergence even in the mean
as in the case of the LMS algorithm. The issues in the convergence
of gradient descent methods are summarized in Table 4.8.
Since there is no proof of convergence, some heuristic criteria are
used to stop the process of learning. They are based on the values of
the gradient and the error in successive iterations and also on the
total number of iterations. The average gradient value over each
training cycle (presentation of all the training patterns once) is
observed, and if this average value is below a preset threshold value
for successive cycles, then the training process may be stopped.
Likewise, the training process may be stopped using a threshold for
the average error and observing the average error in successive cycles.
Feedforward Neural Networks
Table 4.8 Gradient Descent and Convergence
1. Let the input-output vector pair(a, be the sample function of a random
process.
ensemble average of the error
where is the instantaneous error for a given sample function. For
linear units the error surface
is a smooth bowl-shaped in the weight
space and hence the gradient descent converges to the optimal
weight vector
2. Estimation of the error from a finite set of input-output pairs:
=
m = l
For linear units, this error surface is an approximation to the bowl-shape
in the weight space and hence convergence of the gradient descent is only
approximate.
3. Instantaneous error (Linear units):
For linear units, the gradient descent converges only in the mean
(stochastic convergence)
4. Instantaneous error (Nonlinear units):
For nonlinear units, there is no proof of convergence even in the stochastic
sense.
Sometimes both the average gradient as well as the average error
may be used in the stopping criterion. But the main objective is to
capture the implicit pattern behaviour in the training set data so that
adequate generalization takes place in the network. The
generalization feature is verified by testing the performance of the
network for several new (test) patterns.
Performanceof the law: The performance
of the backpropagation learning law depends on the initial setting of
the weights, learning rate parameter, output functions of the units,
presentation of the training data, besides the specific pattern
recognition task (like classification, mapping, etc.) or specific
application (like function approximation, probability estimation,
prediction, logic function, etc.). It is important to initialize the weight
values properly before applying the learning law for a given training
set [Hush et al, 1991;Lee et al, 19911.Initial weights to
a priori knowledge. If we have the knowledge and also if we know
how to present the knowledge in the form of initial weights, then the
overall performance of the resulting trained network in terms of speed
Analysis of Pattern Mapping Networks 127
of learning and generalization would improve significantly. In general
it is not known how to collect the relevant knowledge a priori. The
more difficult part is to know how to include it in the form of weighta.
Therefore all the weights in the network are initialized to random
numbers that are distributed in a small range of values.
The range is typically a
+a I where is the number of
inputs to the ith unit. Thus the range can be different for each unit.
The value of a is typicallyin the range to 3)
and
19921. Initial weighta that are in very small range will result in long
learning times. On the other hand, large initial weight values may
result in the network output values in the saturation region of the
output function. In the saturation region the gradient value is small.
If the saturation is at the incorrect level, it may result in slow
learning due to small changes made in the weighta in each iteration.
Incorrect saturation rarely occurs if the unit operates in the linear
range of the output function.
Adjustment of the weights using backpropagation learning law is
done by presenting the given set of training patterns several times.
Randomizing the presentation of these patterns tends to make the
search in the weight space stochastic, and thus reduces the possibility
of limit cycles' in the trajectory in the weight space during learning
1994, p. Presentation of the training data pattern by
pattern for of the weights makes it possible to have the
learning online. This pattern mode also reduces the problem of local
minima. But to speed up the learning process it is preferable to
update the weights in a batch mode, in which the gradient of the
error, computed over all the training patterns, is used. The batch.
mode gives a better estimation of the gradient of the overall error
surface.
Learning rate parameter plays a crucial role in the
backpropagation learning. The order of values for depends on the
variance of the input data. For the case of the learning rate
parameter 11 where the largest eigenvalue of the
autocorrelation matrix of the input data. This gives an indication for
the choice of since the derivation in the backpropagation does not
any clue for this choice. Since it is a stochastic gradient
descent learning, too small an will result in a smooth trajectory in
the weight space, but takes long time to converge. On the other hand,
too large an may increase the speed of learning, but will result in
large random fluctuations in the weight space, which in turn may
lead to an unstable situation in the sense that the network weights
may not converge.
It is desirable to adjust the weights in such a way that all the
units learn nearly at the same rate. That is, the net change in all
the weights leading to a unit should be nearly the same. To
accomplish this, the learning rate parameters should be different for
128 Feedforward Neural Networks
different weights. The weights leading to a unit with many inputs
should have smaller compared to the for the weights leading to
a unit with fewer inputs. Also, the gradient of the error with
to the weights leading to the output layer will be than the
gradient of the with respect to the weights leading to the hidden
layers. Therefore the learning rate parameters should be typically
smaller for the weights at the output layer and larger for the weights
leading to the units in the hidden layers. This will ensure that the
net change in the weights remains nearly the same for all layers.
Better convergence in learning can be achieved by adapting the
learning rate parameter suitably for each iteration. For this the
change in is made proportional to the negative gradient of the
instantapeous error with respect to 1994, p. 1951. That is
\
where is a proportionality constant.
It was shown in 19941 that
This is called delta-delta learning rule [Jacobs, 19881. The change
in the learning rate parameter depends on the instantaneous
gradients at the previous two iterations. In this learning it is difficult
to suitable values for the proportionality constant if the
magnitudes of the two gradients in the product are either too small
or too large. To overcome this limitation a modification of the above
learning rule, namely, delta-bar-delta learning rule was proposed
[Jacobs, 1988; and Williams, 19901.
The adaptation of the learning rate parameter using the delta-
delta learning rule or the delta-bar-delta learning rule slows down
the backpropagation learning significantly due to additional
complexity in computation at each iteration. It is possible to reduce
this complexity by using the idea of the gradient reuse method, in
which the gradient estimate is obtained by averaging the gradient
values corresponding to several training patterns. Thus
where 1is the index for training pattern and
error. The learning rate parameter is
is the propagated
computed using the
averaged gradient for several training patterns.
The values of the learning rate parameters computed using any
of the above methods are very low, thus resulting in slow learning.
Analysis of Pattern Mapping Networks
129
One way to increase the rate of learning is by using a momentum
term in the weight change as follows et al, 1986;
1989; et al,
where a 1is the momentum constant. The use of the momentum
term accelerates the descent to the minimum of the surface. It
will also help in reducing the effects of local minima of the error surface.
The expression for the updated weight which includes momentum
term as well as the learning rate adaptation is given by
+
(4.108)
Normally the backpropagation learning uses the weight change
proportional to the negative gradient of the instantaneous error. Thus
it uses only the first derivative of the instantaneous error with
to the weight. If the weight change is made using the information in
the second derivative of the error, then a better estimate of the
optimum weight change towards the minimum may be obtained. The
momentum method is one such method where both the weight change
at the previous step and the gradient at the current step are used to
determine the weight change for the current step.
More effective methods 19921 can be derived starting
with the following Taylor series expression of the error as a function
of the weight
where g = is the gradient vector, and H = the Hessian
matrix. For small Aw, the higher order terms can be neglected, so
that we get
= +Aw) -
(4.110)
Taking the derivative of E with respect to w gives the
That is
On the other hand, taking the derivative of
gives
with respect to Aw
130 Feedforward Neural Networks
Setting this to zero gives an optimum value of Aw, taking the
second order term into account. Therefore
Thus the new weight
taking the optimal value of Aw is given by
This is the Newton's method. Note that this is similar to the
expression 17) in Appendix-C.
For the quadratic error function the optimal step Aw* will
lead to the final weight value w* starting from any initial weight
vector That is
provided is known at w = For a nonquadratic error
surface, as in the network with nonlinear units, the Newton's method
gives the optimal weight change if the variation of the error is
considered only the second derivative. Note that the Newton's
method is different from the gradient descent. Since the Newton's
method uses more information of the error surface than the gradient
descent, it is expected to converge faster. But there is no guarantee
that this choice of the weight change will converge.
Implementation of Newton's method is cumbersome due to the
need for computation of the Hessian matrix. Methods were proposed
which will avoid the need for the computation of the Hessian matrix.
The conjugate gradient method is one such method, where the
increment in the weight at the mth step is given by
where the direction of the increment in the weight is a linear
combination of the current gradient vector and the previous direction
of the increment in the weight. That is
where the value of is obtained in terms of the gradient by one
of the following formulae and Reeves, 1964; Polak and
Ribiere, 19691.
Computation of the learning rate parameters in Eq. (4.117)
requires line minimization for each iteration [Johansson et al, 19901.
Analysis of Pattern Mapping Networks
131
The objective is to determine the value of for which the error
is minimized for given values of and
Performance of the conjugate-gradient method depends critically on
the choice of and hence on the line minimization. But generally
the conjugate-gradient method converges much faster than the
standard backpropagation learning, although there is no proof of
convergence in this case also due to the nonquadratic nature of the
error surface and Sangiovanni-Vincentelli,19891.
Refinements of the backpropagation learning: The backpropagation
learning is based on the steepest descent along the surface of the
instantaneous error in the weight space. It is only a first order
approximation of the descent as the weight change is assumed to be
proportional to the negative gradient. The instantaneous error is a
result of a single training pattern, which can be viewed as a sample
function of a random process. The search for the global minimum of
the error surface is stochastic in nature as it uses only the
instantaneous error at each step. The stochastic nature of the
gradient descent results in a zig-zag path of the trajectory in the
weight space in our search for the global minimum of the error
surface. Note that the zig-zag path is also due to the nonquadratic
nature of the error surface, which in turn is due to the nonlinear
output functions of the units. Note also that the backpropagation
learning is based only on the gradient descent and not on any
optimization criterion.
A better learning in terms of convergence towards the global
minimum may be achieved if the information from the given training
patterns are used more effectively. One such approach is based on
posing the supervised learning problem as a nonlinear system
identification problem 19911. The resulting learning
algorithm is called an extended Kalman-type learning [Singhal and
Wu, 19891 which uses piecewise linear approximation to the nonlinear
optimal filtering problem.
Better learning can also be achieved if the supervised learning is
posed as an unconstrained optimization problem, where the cost
function is the error function [Battiti, 19921. In this case the
optimal value of the increment in the weight is obtained by
considering only second order derivatives of the error function.
The resulting expression for the optimal Aw requires computation of
the second derivatives of with respect to all the weights, namely,
the Hessian matrix. The convergence will be faster than the gradient
descent, but there is no guarantee for convergence in this case also.
A multilayer neural network with backpropagation
learning on a finite set of independent and identically distributed
samples leads to an asymptotic approximation of the underlying a
posteriori class probabilities provided that the size of the training set
Input
problem for a function of one variable.
Figure 4.14 Illustrationof an
132 Neural Networks
data is large, and the learning algorithm does not get struck in a
local minima [Hampshire and Pearlmutter, 19901.
If the a posteriori conditional probabilities are used as the desired
response in a learning algorithm based on an information theoretic
measure for the cost function [Kullback, 1968; 1994,
6.201, then the network captures these conditional probability
distributions. In particular, the output of the network can be
interpreted as estimates of the a posteriori conditional probabilities
for the underlying distributions in the given training data.
Yet another way of formulating the learning problem for a
multilayer neural network is by using the fuzzy representation for
input or output or for both. This results in a fuzzy backpropagation
learning law [Ishibuchi et al, 19931. The convergence of the fuzzy
backpropagation learning is significantly faster, and the resulting
minimum mean squared is also significantly lower than the
usual backpropagation learning.
Interpretation of the result of learning: A trained multilayer feed-
forward neural network is expected to capture the functional
relationship between the input-output pattern pairs in the given
training data. It is implicitly assumed that the mapping function
corresponding to the data is a smooth one. But due to limited number
of training samples, the problem becomes an ill-posed problem, in the
sense that there will be many solutions satisfying the given data, but
none of them may be the one and
1977; Wieland and Leighton, 19871. Figure 4.14 illustrates the basic
Training
Function realized
due to overtraining
Output
X
idea of an ill-posed problem for a function of one variable. Given the
samples marked the objective is to capture the function
represented by the solid curve. But depending on the size of the
network, several solutions are possible, including the overtraining
situations (shown by dotted curve) in which for all the training data
Analysis of Pattern Mapping Networks
133
the error is zero. In fact there could be several functions passing
through the given set of points, none of which is the desired one. This
happens if the number of free parameters (weights) of the network
is very large. Such a situation results in a large error when some
other (test) samples are given to validate the network model for the
function. This is called 'poor generalization' by the network. On the
other hand, fewer number of the free parameters may result in a
large error even for the training data, and hence a poor approximation
to the desired function. The function approximation interpretation of
a multilayer neural network enables us to view different
hidden layers of the network performing different functions. For
example, the first hidden layer can be interpreted as capturing some
local features in the input space. The second hidden layer can be
interpreted as capturing some global features. This two-stage
approximation has been shown to realize any continuous
vector-valued function The universal approximation
theorem of Cybenko seems to suggest that even a single layer of
nonlinear units would suffice to realize any continuous function
[Cybenko, 19891. But this result assumes that a hidden layer of
unlimited size is available, and that the continuous function to be
approximated is also available. Thus Cybenko's theorem gives only
an existence proof, but it is not useful to realize the function by
training a single hidden layer network.
A trained multilayer neural network can be interpreted as a
classifier, with complex decision surfaces separating the classes.
These decision surfaces are due to multiple layers of nonlinear units.
In the limiting case of hard-limiting nonlinear units, the geometrical
arguments for the creation of the complex decision surfaces in a
multilayer perceptron discussed in Section 4.3.3 are applicable.
It is also possible to view that the hidden layers perform a
nonlinear feature extraction to map the input data into linearly
separable classes in the feature space. At the output layer the unit
with the largest output is considered as the class to which the input
belongs.
As mentioned earlier, the output of a trained multilayer neural
network can also be considered as an approximation to the a
posteriori class probabilities.
Generalization: A backpropagation learning network is expected to
generalize from the training set data, so that the network can be
used to determine the output for a new test input. As mentioned
earlier, 'generalization' is different from 'interpolation', since in
generalization the network is expected to model the unknown system
or function from which the training set data has been obtained. The
problem of determination of weights the training set data is
called the loading' problem [Judd, 1990; Blum and 19921. The
134 Feedforward Neural Networks
generalization performance depends on the size and efficiency of the
training set, besides the architecture of the network and the
complexity of the problem [Hush and Horne, 19931. Testing the
performance of the network with new data is called cross-validation.
If the performance for the test data is as good as for the training
data, then the network is said to have generalized from the training
data. Further discussion on generalization is given later in Section
7.3 and in Appendix D.
Tasks with backpropagation network: A backpropagation network
can be used for several applications such as realization of logic
functions, pattern classification, pattern mapping, function approxi-
mation, estimation of probability distribution and prediction [Hush
and Horne, 19931. These tasks were demonstrated in several real
world applications such as in speech, character recognition, system
identification, passive sonar speech synthesis,
etc. [Sejnowski and Rosenberg, 1987;Cohen et al, 1993; et al,
1990;Narendra and Parthasarathy, 1990;Casselman et al, 19911.
Limitations of backpropagation: The major limitation of the back-
propagation learning is its slow convergence. Moreover, there is no
proof of convergence, although it seems to perform well in practice.
Due to stochastic gradient descent on a nonlinear error surface, it is
likely that most of the time the result may converge to some local
minimum on the error surface and Tesi, 19921.There is no easy
way to eliminate this effect completely, although stochastic learning
algorithms were proposed to reduce the effects of local minima
19881.Another major problem is the problem of scaling.
When the complexity of the problem is increased, there is no
guarantee that a given network would converge, and even if it
converges, there is no guarantee that good generalization would
result. The complexity of a problem can be defined in terms of its
size or its predicate order and 1990;Hush and Horne,
19931.Effects of scaling can be handled by using the prior information
of the problem, if possible. Also, modular architectures can also
reduce the effects of the scaling problem [Ballard, 1990;Jacobs et al,
1991; 19941.
For many applications, the desired output may not be known
precisely. In such a case the backpropagation learning cannot be used
directly. Other learning laws have been developed based on the
information whether the response is correct or wrong. This mode of
learning is called reinforcement learning or learning with critic
[Sutton et al, 1991; 19921 as discussed in Section 2.4.6.
Extensions of backpropagation: Principles analogous to the ones
used in the backpropagation network have been applied to extend the
Summary and Discussion 135
scope of the network in several directions as in the case of probabi-
listic neural networks, fuzzy backpropagation networks, regularization
networks and radial basis function networks 19931.
4.5 Summary and Discussion
We have presented a detailed analysis of feedforward networks in
this chapter with emphasis on the pattern recognition tasks that can
be realized using these networks. A network with linear units
(Adaline units) performs a pattern association task provided the input
patterns are linearly independent. Linear independence of input
patterns also limits the number of patterns to the dimensionality of
the input pattern space. We have seen that this limitation is overcome
by using hard-limiting threshold units (perceptron units) in the
feedforward network. Since threshold units in the output layer results
in a discrete set of states, the resulting network performs pattern
classification task. The hard-limiting threshold units provide a set of
inequalities to be satisfied by the network. Thus the weights of the
network are not unique any more and hence they are determined by
means of the perceptron learning law.
A single layer perceptron is limited to linearly separable classes
only. For an arbitrary pattern classification problem, a multilayer
perceptron is needed. But due to absence of desired output at
the units in the intermediate layers of units, the MLP network cannot
be trained by the simple perceptron learning law. This hard learning
problem can be solved by using nonlinear units with differentiable
output functions. Since the output functions are now continuous, the
multilayer feedforward neural network can perform pattern mapping
task. The output error is used in the learning
algorithm for these multilayer networks.
Since the backpropagation learning is based on stochastic
gradient descent along a rough error surface, there is no guarantee
that the learning law converges towards the desired solution for a
given pattern mapping task. Several variations of the back-
propagation learning have been suggested to improve the convergence
as well as the result of convergence. Although there is no proof of
convergence, the backpropagation learning algorithm seems to
perform effectively for many tasks such as pattern classification,
function approximation, time series prediction, etc.
How well a trained feedforward network performs a given task
is discussed both theoretically and experimentally in the literature
on generalization. The issue of generalization is an important topic,
but it is not discussed in this book. There are excellent treatments
of this topic in 1997; Valiant, 19941. Appendix D gives
an overview of generalization in neural networks.
Some of the limitations of backpropagation such as convergence
Chapter 5
Feedback Neural Networks
5.1 Introduction
This chapter presents a detailed analysis of the pattern recognition
tasks that can be performed by feedback artificial neural networks.
In its most general form a feedback network consists of a set of
processing units, the of each unit is fed as input to all
other units including the same unit. With each link connecting any
two units, a weight is associated which determines the amount of
output a unit feeds as input to the other unit. A general feedback
network does not have any structure, and hence is not likely to be
useful for solving any pattern recognition task.
However, by appropriate choice of the parameters of a feedback
network, it is possible to perform several pattern recognition tasks.
The simplest one is an autoassociation task, which can be performed
by a feedback network consisting of linear processing units. A detailed
analysis of the linear autoassociative network shows that the network
is severely limited in its capabilities. In particular, a linear
autoassociative network merely gives out what is given to it as input.
That is, if the input is noisy, it comes out as noisy output, thus giving
an error in recall even with optimal setting of weights. Therefore a
linear autoassociative network does not have any practical use. By
using a nonlinear output function for each processing unit, a feedback
network can be used for pattern storage. The function of a feedback
network with nonlinear units can be described in terms of the
of the state of the network with time. By associating an
energy with each state, the trajectory describes a traversal along the
energy landscape. The minima of the energy landscape correspond to
stable states, which can be used,to store the given input patterns.
The number of patterns that can be stored in a given network depends
on the number of units and the strengths of the connecting links. It
is quite possible that the number of available energy minima is less
than the number of patterns to be stored. In such a case the given
pattern storage problem becomes a hard problem for the network. If
on the other hand, the number of energy minima in the energy
landscape of a network is greater than the required number of
Introduction 143
patterns to be stored, then there is likely to be an error in the recall
of the stored patterns due to the additional false minima. The hard
problem can be solved by providing additional (hidden) units in a
feedback network, and the errors in recall of the stored patterns due
to false minima can be reduced using probabilistic update for the
output function of a unit. A feedback network with hidden units and
probabilistic update is called a Boltzmann machine. It can be used to
store a pattern environment, described by a set of patterns to be stored,
together with the probability of occurrence of each of these patterns.
Table 5.1shows the organization of the topics to be discussed in
this chapter. A detailed analysis of linear autoassociativefeedforward
networks is considered first in Section 5.2.The pattern storage problem
is analyzed in detail in Section 5.3.In particular, the energy
analysis, and the issues of hard problem and false minima are discussed
in this section. The Boltzmann machine is introduced in Section 5.4.
This section also deals with the details of the pattern environment
storage problem and the Boltzmann learning law. Some practical
issues in the implementation of learning laws for feedback networks
including simulated annealing are discussed in Section 5.5.
Table 5.1 Pattern Recognition Tasks by Feedback Neural Networks
with feedback, linear processing units
Autoassociation
Architecture:Single
Learning: Not important
Recall: Activation dynamics until stable states are reached
Limitution: No accretive behaviour
To overcome: Nonlinear processing units, leads to a pattern storage
problem
Storage
Architecture: Feedback neural network, nonlinear processing units,
states, energy analysis
Learning: Not important
Recall: Activation dynamics until stable states are reached
Limitation: Hard problems, limited number of patterns, false minima
To overcome: Stochastic update, hidden units
Pattern Environment Storage
Architecture: Boltzmann machine, nonlinear processing units, hidden
units, stochastic update
Learning: Boltzmann learning law, simulated annealing
Recall: Activation dynamics, simulated annealing
Limitation: Slow learning
To Overcome: Different architecture
144 Feedback Neural Networks
5.2 Analysis of Linear Autoassociative FF Networks
we censider the realization of an autoassociative task with a
feedforward as shown in Figure 5.1. Analogous to the hetero-
5.1 Linear autoassociative network.
association, in autoassociation the objective is to associate a given
pattern with itself during training, and then to recall the associated
pattern when an version of the same pattern is
given during testing. In other words, in the associated
output pattern is same as the input pattern for the pattern.
That is, with reference to the pattern association task, = 1 = 1,
2, L in the case of autoassociation. In recall it is desired to obtain
as output for an approximate input + The weight matrix
W = of a linear autoassociative network (Figure 5.1) can be
determined as in the case of the linear heteroassociator, for a fixed
set of input pattern vectors Since we want WA= A, the optimal
weights are given by (see Section
(5.1)
where A+ is the pseudoinverse of the M x L matrix A consisting of
the input vectors The pseudoinverse is given in the
components of singular value decomposition of the matrix A as follows:
where are the eigenvalues, and and are the eigenvectors of
the matrices AA
T
and respectively. That is,
and
Analysis of Linear Autoassociative FF Networks 145
The sets of and are orthonormal sets. The
ei are real and nonnegative, since the matrices AA
T
and
A A are symmetric. The eigenvalues are ordered,
If
the rank of the matrix A is L), then the eigenvalues i will
be zero. Therefore
i = l
and the
The minimum error for the choice of the optimum weight matrix
is given from Eq. (4.16) as
But since = for i = Thus in the case of linear
autoassociative network there is no error in the recall due to linear
dependency of the input patterns, unlike in the case of linear
heteroassociative network. In other words, in this case the input
comes out as output without any error.
When noise is added to the input vector, the noisy input vectors
are given by
where the noise
=
is
L
with the input vector and
has the average power or variance For the choice of W= the
error in recall is given Eq. as and Aibara,
Thus the error in the recall is mainly due to noise, as the linear
dependence component of the error is zero in the case of
association. Note that this is because a noisy input vector comes out
146
Feedback Neural Networks
as a noisy output and hence its difference from the true vector in the
recall is only due to noise.
The error in recall due to noise can be reduced by the choice of
W = where
where is given by
he
That is, for a given noise power the error can be reduced by
moving the error into the linear dependency term, which is realized
by an appropriate choice of the number of terms expression
for t seudoinverse. The resulting error for the optimal choice of
=
is
The linear task can also be realized by a single layer
feedback network with linear processing units shown in Figure 5.2. The
Figure 5.2 Linear autoassociation by a feedback network.
condition for autoassociation, namely, = is satisified if
W = I, an identity matrix. This trivial choice of the weight matrix is
realized if the input vectors are linearly independent, so that
= I. For this choice of W, the output for a noisy input
+ is given by + = a,+ which is the noisy input itself. This
is due to lack of accretive during recall, and such a
feedback network is not useful for storing information. It is possible
to make a feedback network useful, especially for pattern storage, if
the linear processing units are replaced with processing units having
nonlinear output functions. We discuss this case in the next section
and give a detailed analysis of pattern storage networks.
3. Analysis of Pattern Storage Networks
1. Pattern Storage Networks
The objective in a pattern storage task is to store a given set of
Analysis of Pattern Networks 147
patterns, so that any of them can be recalled exactly when an
approximate version of the corresponding pattern is presented to the
network. For this purpose, the features and their spatial relations in
the patterns need to be stored. The pattern recall should take place
even when the features and their spatial relations are slightly
disturbed due to noise and distortion or due to natural variation of
the pattern generating process. The approximation of a pattern refers
to the closeness of the features and their spatial relations to the
original stored pattern.
Sometimes the data itself is actually stored through the weights,
as in the case of binary patterns. In this case the approximation can
be measured in terms of some distance, like Hamming distance,
between the patterns. The distance is automatically captured by the
threshold feature of the output functions of the processing units in a
feedback network Freeman and Skapura, 19911.
Pattern storage is generally accomplished by a feedback network
consisting of processing units with nonlinear output functions. The
outputs of the processing units at any instant of time define the
output state of the network at that instant. Likewise, the activation
values of the units at any instant determine the activation state of
the network at that instant.
The state of the network at successive instants of time, the
trajectory of the state, is determined by the activation dynamics
model used for the network. Recall of a stored pattern involves
starting at some initial state of the network depending on the input,
and applying the activation dynamics until the trajectory reaches an
equilibrium state. The final equilibrium state is the stored pattern
resulting the network for the given input.
Associated with each output state is an energy (to be defined
later) which depends on the network parameters like the weights and
bias, besides the state of the network. The energy as a function of
the state of the network corresponds to something like an energy
landscape. The shape of the energy landscape is determined by the
network parameters and states. The feedback among the units and
the nonlinear processing in the units may create basins of attraction
in the energy landscape, when the weights satisfy certain constraints.
Figure 5.3 shows energy landscapes as a function of the output state
for the two cases of with and without the basins of attraction. In the
latter case the energy fluctuates quickly and randomly from one state
to another as shown in Figure
basins of attraction as in Figure
But in the energy landscape with
the states around the stable
state correspond to small deviations from the stable state. The
deviation can be measured in some suitable distance measure, such
as Hamming distance for binary patterns. The Hamming distance
between two binary patterns each of length N is defined as the
number of bit positions in which the patterns differ. Thus the states
148
Feedback Neural Networks
Energy Energy
State
Figure 5.3
Energy landscapes with basins of attractionand without
basins of attraction.
closer to the stable states correspond to patterns with smaller
Hamming distance.
The basins of attraction in the energy landscape tend to be the
regions of stable equilibrium states [Cohen and Grossberg, 19831. If
there is a fixed state in each of the basins where the energy is
minimum, then that state corresponds to a fixed point of equilibrium.
The basins could also be periodic (oscillatory) regions or chaotic
regions of equilibria. For an oscillatory region, the state of the
network changes continuously in a periodic manner. For a chaotic
region, the state of the network is not predictable, but it is confined
to the equilibrium region. Throughout the subsequent discussion we
consider only the fixed points of equilibrium in the energy landscape.
It is the existence of the basins of attraction or regions of
equilibrium states that is exploited for the pattern storage task. The
fixed points in these regions correspond to the states of the energy
minima, and they are used to store the desired patterns. These stored
patterns can be recalled even with approximate patterns as inputs.
An erroneous pattern is more likely to be closer to the corresponding
true pattern than to the other stored patterns according to some
distance measure. Each input pattern results in an initial state of
the network, which may be closer to the desired true state in the
sense that it may lie near the basin of attraction corresponding to
the true state. An arbitrary state may not correspond to an
equilibrium or a stable state. As the dynamics of the network evolves,
the network may eventually settle at a stable state, from which the
pattern may be read or derived.
a network specified by the number of processing units, their
connection strengths and the activation dynamics, it is not normally
possible to determine exactly the number of basins of attraction in
the energy landscape as well as their relative spacings and depths
in the state space of the network. The spacing between two states
can be measured .by a suitable distance measure, such as the Hamm-
ing distance for binary patterns. The number of patterns that can be
stored is called the capacity of the network. It is possible to estimate
Analysis of ,Storage Networks
149
the capacity of the network and also the average probability of error
in recall. The probability of error in recall can be reduced by adjusting
the weights in such a way that the resulting energy landscape is
matched to the probability distribution of the desired patterns.
Typically the capacity of a fully connected network is of the order
of N, the number of processing units. Although there are different
states for a network with binary state units, the network can be used
to store only of the order of N binary as there will be only
that many fixed points or energy minima in the energy landscape.
In general, the number of desired patterns is independent of the
number of basins of attractions. The latter depends only on the
network units and their interconnections. If the number of patterns
is more than the number of basins of attraction, then the pattern
storage problem becomes a hard problem, in the sense that the
patterns cannot be stored in the given network. On the other hand,
if the number of patterns is less than the number of basins of
attraction, then there will be the so called false wells or minima due
to the additional basins of attraction. During recall, it is likely that
the state of the network, as it evolves from the initial state
corresponding to the input pattern, may settle in a false well. The
recalled pattern to the false well may not be the desired
pattern, thus resulting in an in the recall.
In the next subsection we will consider the model of a
feedback network for the pattern storage and discuss the working of
a discrete model. The model is a fully connected
feedback network with symmetric weights. In the discrete
network the state update is asynchronous and the units have
output functions. In the continuous model the
state update is dictated by the activation dynamics, and the units
have continuous nonlinear output functions.
5.3.2 The Hopfleld Model
Consider the neuron model for the units of a feedback
network, where the output of each unit is fed to all the other units
with weights for all i and j. Let the output function of each of
the units be bipolar 1) so that
and
where is the threshold for the unit i. We will assume = for
convenience. The state of each unit is either or - 1 at any given
instant of time. Due to feedback, the state of a unit depends on the
150 Feedback Neural Networks
states of the other units. The updating of the state of a unit can be
done synchronously or asynchronously. In the synchronous update all
the units are simultaneously updated at each time instant, assuming
that the state of the network is frozen until update is made for all
the units. In the asynchronous update a unit is at random
and its new state is computed. Another unit is selected at random
and its state is updated using the current state of the network. The
the random choice of a unit is continued until no
further change in the state takes place for all the units. That is, the
state at time +1) is the same as the state at time t for all the
units. That is
+1) =
for all (5.15)
In this situation we can say that the network activation dynamics
reached a stable state. We assume update throughout the
following discussion. Note that the asynchronous update ensures that
next state is at most unit Hamming distance from the current state.
If the network is to store a pattern a = a,, then in
a stable state we must have the updated state value to be the same
as the current state value. That is
This can happen if
=
because
where = 1 for bipolar states.
For storingL patterns, we could choose a general Hebbian rule given
by the of the Hebbian terms for each pattern. That is,
Then the state
will be stable if
N L
= a,,, i
I
(5.19)
Taking out the 1= k term in the summation and simplifying it using
= 1, we get
Since = 1, the above is true for all
1
= for all i (5.20)
provided the
Analysis of Pattern Stomge Networks
151
in Eq. (5.20) does not change the sign of plus the crossterm.
Table 5.2 gives an algorithm for storing and recall of patterns in a
network.
Table 6.2 Network Algorithm to Store and Recall a Set of Bipolar
Patterns
Let the network of N fully connected with each unit having
hard-limiting bipolar threshold output function. Let = 1, 2, L be the
are assumedto have bipolar components,
vectors to be stored. The
= 1,
1.
the connection weights
2. Initialize the network output with the given unknown input pattern a
= for = 1, 2,
where
the output of the unit i at time t =
3. Iterate until convergence
+ =
, for i = 1, 2, N
The process is repeated until the outputs remain unchanged with
further iteration. The steady outputa of the represent the stored
pattern that best matches the given input.
In general, the in Eq. (5.20) is negligible if LIN 1.
Eq. (5.20) is satisfied if the number of patterns L is limited to the
storage capacity of the network, the maximum number of
patterns that can be stored in the network.
5.3.3 of
We consider the discrete
Model
model to derive the capacity of the
network. Let us consider the following quantity [Hertz et al, 19911
If is negative then the cross term and have the same sign in
Eq. (5.20) and hence the pattern is stable. On the other hand, if
is positive and greater than 1, then the sign of the cross term
changes the sign of plus the cross term in Eq. (5.20). The result
is that the pattern turns out to be unstable, and hence the desired
pattern cannot be stored.
152 Feedback Neural Networks
is given by
1) (5.22)
Therefore the probability of
=
To compute this probability, let us assume that the probability of
the cross term
independent random
equal to or - 1 is 0.5. For random
corresponds to times the sum of about
numbers, each of which is or -1. Thus
is a sum of random
variables having a binomial distribution with zero mean and
=
If NL is assumed large, then the distribution of
can be approximated by a Gaussian distribution with zero mean
and variance 19911. Therefore,
where is error function given by
This gives a value of = 0.105 for 0.001. Thus the
maximum number of patterns that can be stored for a probability of
error of 0.001 is = 0.105 N.
A more sophisticated calculation
et al, 1987;
using probabilistic update leads to a capacity of = 0.138 N.
5.3.4 Energy Analysis Hoptield Network
Discrete Hoptleld model: Associated with each state of the network,
proposed an energy function whose value always either
reduces or remains the same as the state of the network changes.
Assuming the threshold value of the unit i to be the energy
function is given by [Hopfield, 19821
The energy as a function of the state s of the network describes
the energy landscape in the state space. The energy landscape is
determined only by the network architecture, the number of
units, their output functions, threshold values, connections between
units and the strengths of the connections. has shown that
for symmetric weights with no self-feedback,
= , and with
bipolar I-1, or binary 1)output functions, the dynamics of the
Analysis of Pattern Storage Networks 153
network using the asynchronous update always leads towards energy
minima at equilibrium. The states corresponding to these energy
minima turn out to be stable states, which means that small
perturbations around it lead to unstable states. Hence the dynamics
of the network takes the network back to a stable state again. It is
the existence of these stable states that enables us to store patterns,
one at each of these states.
To show that AV 0, let us consider the change of state due to
update of one unit, say k, at some instant. All other units remain
unchanged. We can write the expressions for energy before and after
the change as [Freeman and 19911:
1
o
ld old +
.
i j
The change in energy due to update of the kth unit is given by
1
i t k j + k
i +k
Since = for i k, the first two terms on the right hand
side of Eq. (5.27) will be zero. Hence,
If the weights are assumed symmetric,
=
then we get
If, in addition, = then since
=
for i k, the terms
in both the parentheses are equal. Therefore,
(5.30)
154 Feedback Neural Networks
The update rule for each unit
Case A: If
is as follows:
then
= +1
= -
1
Case B: If - 0, then
Case C: If - = then
=
For case A, if
= +1, then AV= 0, and if
- 1 , then AV 0.
For case B, if
= +1, then
0, and if = - 1 , then AV=
For case C, irrespective of the value of , AV = 0.
Thus we have Therefore the energy decreases or remains
the same when a unit, selected at random, is updated, provided the
weights are symmetric, and the self-feedback is zero. This is the
energy analysis for discrete model.
That the expression for V in Eq. (5.25) does indeed represent
some form of energy can be seen from the following arguments based
on law:
If a given pattern vector is to be stored in the network state
vector s, then the match will be perfect if both the vectors coincide.
That is, the magnitude of their inner product is maximum.
Alternatively, the negative of the magnitude of their inner product
is minimum. Thus we can choose a quantity et al, 19911
to be minimized for storing a pattern vector in the network. For
storing L pattern vectors we can write the resulting V as a summation
of the contributions due to each pattern vector. That is
If we identify the weights with the term then we get
1=1
which is same as the first term in the
given in Eq.
energy expression
Analysis of Pattern Storage Networks
155
This method of identifying from an energy function is useful,
especially to solve several optimization problems. Given an energy
function or a cost function or an objective function for a problem in
terms of its variables and constraints, if we can identify the
coefficients associated with s. and constant terms in the function,
then a feedback network can built with weights corresponding to
these coefficients. Then using an activation dynamics for the network,
the equilibrium state or states can be found. These states correspond
to the minima or maxima of the energy function. Higher order terms
consisting of product of three
handled by the feedback model with
more variables cannot be
connections.
Continuous model. In this subsection we will consider the
energy analysis for a continuous model [Hopfield, 1984;
Hertz et al, 1991;Freeman and 19911.A continuous model is
a fully connected feedback network with a continuous nonlinear output
function in each unit. The output function is typically a sigmoid
function = which is shown in Figure 5.4 for different
+
(a)Sigmoid function f = for different values of gain
+
parameter The inverse function. Contribution of to
the energy function.
156 Feedback Neural Networks
values of the gain parameter In the continuous model all the units
will change their output signals continuously and
towards values determined by the output function. The activation
values will also change continuously according to = s..
This is reflected in the following equation for the activation dynamics:
where is the time constant and s, =
Consider the following energy function [Hopfield, 19841:
We can show that in this case
1
I 0.
ds.
dt
j
2 .
(5.36)
Assuming symmetry of weights,
=
we get
Using the relation in Eq. we get
Since is a monotonically increasing function, Hence
0.
Note that = when = 0, for all i. This shows that
the activation dynamics eventually leads to a state where the energy
has a local minimum value, = This happens
when the activation state reaches an equilibrium steady state at
which there is no further change in the activation values,
= The above result, namely, I shows that the
energy always decreases as the state of the network changes.
Let us examine the differences between the continuous model and
the discrete model. In the discrete model only one unit is considered
Analysis of Pattern Storage Networks
157
at a time for update. The choice of the unit for update is random and
the dynamics is that of the steady activation values = 0), since
the transients are assumed to have died down at each update of the
state of the unit. Hence in the discrete case = and = are
different conditions. In the continuous case the states of all the units
and hence the state of the network change continuously, as dictated
by the differential equations for the activation dynamics. Hence, in
this case = for all i implies that V = The energy function V
is also called the
The
function of the dynamical system.
in the energy functions for the discrete and
continuous case is due to the extra term in Eq. (5.35).
= 1. For a general gain value
ds. The integral term is for
approaches (see Figure
This expression is for a gain value
this term is given by
0
and becomes very large as
But for high gain values (h
this term in the energy function
becomes negligibly small, and hence the energy function approaches
that of the discrete case. In fact when the output function
becomes a bipolar function, and hence is equal to the discrete case.
In the discrete case the energy minima are at some of the corners of
the hypercube in the N-dimensional space, since all the states are at
the corners of the hypercube. On the other hand, for moderate or
small values of h, the integral term contributes to large positive
values near the surfaces, edges and corners of the hypercube, and it
contributes small values interior to the hypercube. This is because
the value of is 1 at the surfaces, edges and corners. Thus the energy
minima will be to the interior of the hypercube. As
0, the minima of the energy function disappear one by one, since
all the states will tend to have the same energy value.
The energy analysis so far shows that, for symmetric weights on
the connections, there exist basins of attraction with a fixed point or
a stable point for each basin corresponding to an energy minimum.
If the connections are not symmetric, then the basins of attraction
may correspond to oscillatory or chaotic states regions. In the case
of purely random connections, with mean and variance there
will be a transition from stable to chaotic behaviour as is
et al, 1988; Hertz, 1995; Hertz et al, 19911.
We can summarize the behaviour of feedback networks in relation
to the complexity of the network as follows: To make a network useful
for pattern storage, the output functions of the units are made
limiting nonlinear units. For analysis in terms of storage capacity,
as well as for the recall of information from the stable states, we
have imposed conditions of symmetry of weights and asyn-
chronous update. A more natural situation will be to use continuous
158 Feedback Neural Networks
output functions, so that any type of pattern can be stored. But the
analysis of the performance of the network will be more difficult. In
addition, if we relax the conditions on the symmetry of weights, we may
get stable regions, but it is not possible to analyse the network in
terms of its storage capacity and retrieval of information. If we further
relax the constraints to make the feedback system more closer to the
natural biological system, then we may be able to get better
functionality, but it is almost impossible to analyse such complex
networks. For example it is not possible to predict the global pattern
behaviour of a feedback network with random weights. Thus, although
the networks may get more and more powerful by relaxing the
constraints on the network, they become less useful, if we cannot predict
and control the pattern storage and recall of the desired information.
State Transition Diagram
Derivation of state transition diagram: The energy analysis of the
network in the previous section shows that the energy of
the network at each state either decreases or remains the same as
the network dynamics evolves. In other words, the network either
remains in the same state or moves to a state having a lower energy.
This can also be demonstrated by means of a state transition diagram
which gives the states of the network and their energies, together
with the probability of transition from one state to another. In this
section we will illustrate the state transition diagram (adapted from
and Morton, 19901) for a 3-unit feedback network with
symmetric weights = The units have a threshold value of
i = 2, 3 and a binary 1 ) output function. A binary output
function is assumed for convenience, although the conclusions are
equally valid for the bipolar (-1, case.
Figure 5.5 shows a 3-unit feedback network. The state update for
the unit i is governed by the following equation:
threshold
Figure 5.5 A 3-unit feedback network with symmetric weights
values and the output states i =
Analysis of Pattern Storage Networks
The energy at any state of the network is given by
There are eight different states for the 3-unit network, as each
of the may assume a value either or 1. Thus the states are: 000,
001, 010, 100, 011, 101, 110 and 111. Assuming the values
we get the following energy
0.0,
= 0.1,
0.7,
= 0.1,
= -
each state.
- 0.2,
= 0.2, and
0.0.
The transition from state to the next state can be computed using
the state update Eq. (5.39). For example, if the current state is
000, by selecting any one unit, say unit 2, at random, we can find
3
its next state by computing the activation value =
comparing it with the threshold Since 0)
and
- 0.2) the
state of the unit 2 changes from to 1. Thus if we this unit,
there will be a transition from the state 000 to 010. Since we can
select any one of the three units with equal probability, U3, the
probability of making a transition from 000 to 010 is thus U3.
by selecting the unit 1 for update, the network makes a
transition 000 to 100 with a probability U3. Selecting the unit
3 for update results in a transition from 000 to itself, since the
activation 0) 0.7). By computing the transition probabili-
ties for all the states, we get the state transition diagram shown in
Figure 5.6. Note that while computing the transitions, only asynchro-
nous update of each unit selected at random was used. Table 5.3
shows the computation of the state transitions by comparing the
weighted inputs with the threshold value for each unit. The entries
in the parenthesis are
.
the state transition diagram we observe the following points:
The diagram is drawn in such a way that the higher energy states are
shown the lower energy states. The transition is always a
higher energy state to a state with equal or lower energy. Thus the
result AV is satisfied. There are some states 010, 100
and 111which have a self-transition probability of 1. That means, once
these states are reached, the network remains in these states, which is
equivalent to saying that the activation dynamics equation is such that
160
Feedback Neural Networks
0
0 1
(a) A 3-unit network
1
3
- 0.2
3
3
diagram
(b)State
Figure 5.6 A 3-unit network and the corresponding state transition
diagram. (Adapted from [Aleksander and Morton,
whereis the binary output function, 0, for and
1 , for Since there is no transition these states to
other states, these are stable states. Note that only three out of the
total eight are stable states. As per the approximate capacity
calculations made in Section 5.3.3 for a network, the number
Analysis of Pattern Storage Networks
161
Table 6.3 Computation of State Transitions for Figure 5.6
Unit 1 Unit 2 Unit 3
of stable states will be much fewer than the number of possible states,
and in fact the number of stable states are of the order N. The stable
states are always at the energy minima, so that the transition to any
of these states is always from a state with a higher energy value
than the energy value of the stable state.
Computatlon of welghts for pattern storage: So far we have
dered the analysis of a given feedback network and studied its
characteristics. But patterns can be stored at the stable states by
design. That is, it is possible to determine the weights of a network
by calculation in order to store a given set of patterns in the network.
Let 010 and 111 be the two patterns to be stored in a 3-unit binary
network. Then at each of these states the following activation
dynamics equations must be satisfied:
This will result in the following inequalities for each of the states:
For the state = 010
and for the state
= 111
162
Feedback Neural Networks
Since we assume symmetry of the weights
above inequalities reduce to
= and 0, the
The following choice of the thresholds and weights, namely,
satisfies the above inequalities and hence the resulting network given
in Figure stores the given two patterns. These two patterns
correspond to the stable states 010 and 111 in the network as can
be seen from the state transition diagram in Figure The energy
values for different states are as follows:
(a) A 3-unit network
State transition diagram
Figure 5.7 A 3-unit network and the corresponding state transition
diagram. (Adapted from [Aleksander and Morton, 19901).
Analysis of Pattern Networks
The energies of different states are:
Table 5.4 shows the computation of the state transitions by
aring the weighted inputs with the threshold values for each unit in
each state. The entries in the parenthesis are
=
.
J
Table 6.4 Computation of State Transitions for Figure 5.7
unit 1 Unit 2 Unit 3
5.3.6 Pattern Storage by Computationof Weights-Problemof False
EnergyMinima
For another choice of the values of and which satisfies all the
inequalities for the above problem of storage of the patterns 010 and
111in a 3-unit network, there may be more than two energy minima
or stable states in the network. Two of them correspond to the desired
patterns and the other extra states correspond to false minima. This
is illustrated for the choice of the thresholds and weights shown in
The corresponding state diagram is given in
the Figure Here there are three energy minima corresponding
to the three stable states 010, 100, 111.
The presence of the extra stable state may result in recalling a
pattern not in the set of the desired patterns to be stored. If an
approximate input is given to the units in the network, so that the
network is forced into the state, say = 000 initially, then since
this state is unstable, the dynamics of the network will eventually
lead to either the state 010 (the desired pattern) or to the state 100
164 Feedback Neural Networks
(See the state transition diagram in Both these states
are stable states and have equal probability of transition from the
initial state 000. While the state 010 is the desirable pattern to be
recalled, there is an equal chance that the pattern 100 may be
recalled. Likewise, if the initial state is 110, then there is an equal
chance that any one of the stable states 010, 100 and 111 may be
recalled. The recall of the pattern 111 results in an undetectable
error, as the desired pattern is 010 for the approximate input 110.
The recall of the pattern 100 at any time will give as output a pattern
which was not stored in the network intentionally, since in our
pattern storage task we have specified only 010 and 111 as the
desired patterns to be stored in the network. The stable state 100 in
this case corresponds to a false (undesirable) energy minimum.
Errors in recall due to false minima can be reduced in two ways:
1. By designing the energy minima for the given patterns in an
optimal way, so that the given patterns correspond to the
lowest energy minima in the network.
2. By using a stochastic update of the state for each unit, instead
of the deterministic update dictated by the activation values
and the output function.
The issue of stochastic update will be discussed in Section 5.4, and the
issue of designing energy wells by learning in Section 5.5.
Pattern
have discussed the
problems: In the previous subsection we
of having more minima in the energy
landscape than the number required to store the given patterns. In
this section we consider the case of the so called problems of
pattern storage. Let us consider the problem of storing the patterns
say 000, 011, 101 and 110. By using the condition
f
- )= for each unit i, the inequalities to be satisfied to
make these states stable in a 3-unit feedback network can be derived.
In this case no choice of thresholds and weights can satisfy all the
constraints in the inequalities. The reason is that the number of desired
patterns is more than the capacity of the network, and hence they cannot
be
in a feedback network with 3 units. In some cases,
even if the number of desired patterns is within the capacity limit of a
network, the patterns may not be representable in a given type
(binary) of a feedback network For example, for storing the patterns
000 and 100, the following inequalities have to be satisfied by the type
of network we have been considering so far.
0, 0, 0,
The conditions on and
(5.43)
cannot obviously be satisfied
simultaneously by any choice of In fact any pair of patterns within
a Hamming distance of 1 cannot be stored in a 3-unit network.
Stochastic Networks and Simulated Annealing
165
Pattern storage problems which be represented by a
feedback network of a given size, can be called hard problems. This
is analogous to the hard problems in the pattern classification task
for a single layer perceptron network. Hard problems in the pattern
storage task are handled by introducing additional units in the
feedback network. These units are called hidden units. But with
hidden units it is difficult to write a set of inequalities as before to
make the given patterns correspond to stable states in the feedback
network. Thus the design of a network with hidden units becomes
difficult due to lack of a straightforward approach for determining
the weights of the network. In other words, this may be viewed as
hard learning problems. We will see in Section 5.5 how this problem
is addressed in
patterns, a network with
learning law. To store a given number of
large number of units may have
to be considered. But in general it is difficult to know the required
number of units exactly for a given number of patterns to be stored.
4. Stochastic Networks and Simulated Annealing
1. Update
Error in pattern recall due to false minima can be reduced
significantly if initially the desired patterns are stored (by careful
training) at the lowest energy minima of a network. The error can
be reduced further by using suitable activation dynamics. Let us
assume that by training we have achieved a set of weights which will
enable the desired patterns to be stored at the lowest energy minima.
The activation dynamics is modified so that the network can also
move to a state of higher energy value initially, and then to the
nearest deep energy minimum. This way errors in recall due to false
minima can be reduced.
It is possible to realize a transition to a higher energy state from
a lower energy state by using a stochastic update in each unit instead
of the deterministic update of the output function as in the
model. In a stochastic update the activation value of a unit does
decide the next output state of the unit by directly using the output
function as shown in Figure Instead, the update is expressed
in probabilistic terms, like the probability of firing by the unit being
greater than 0.5 if the activation value exceeds a threshold, and less
than 0.5 if the activation value is less than the threshold. Note that
the output function is still a nonlinear function, either a
hard-limiting threshold logic function or a semilinear sigmoidal
function, but the function itself is applied in a stochastic manner.
Figure shows a typical probability function that can be used
for stochastic update of units. The output function itself is the binary
logic function shown in Figure